Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add c14n for node and document #138

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Add c14n for node and document #138

wants to merge 1 commit into from

Conversation

saks
Copy link

@saks saks commented Sep 6, 2024

There are not too many choices when it comes to "Canonical XML" implementation for rust. This PR is a compilation of ideas, approaches and test files from the following projects:

Work items

  • add canonicalize method for Node and Document
  • add Node::ancestors method (I personally find it very useful when using "nokogiri", plus it help in tests a lot). Although I don't feel too strongly about, can remove if it doesn't fit.
  • add Node::at_xpath method. Yes, it is similar to findnodes (and can ultimately be implemented using Context), but with design from "nokogiri", I find it a lot more flexible when navigating a tree. Same as for the previous item, can update PR to remove it if necessary.

Any feedback is welcome! I don't feel particularly attached to any approach here. Please let me know how to make this PR better.

@dginev
Copy link
Member

dginev commented Sep 6, 2024

@saks interesting, there is a very high-level question I have first.

Why do you think this wrapper crate is the right place to introduce this method?
rust-libxml mostly tries to follow the libxml2 coverage, and probably should not grow much further, as a general-purpose XML utility library.

As you mention xml_c14n exists as a separate Rust crate. Maybe your work could also be packaged as a separate crate?

@tstenner
Copy link

c14n is a part of the standard libxml2 featureset, so it would be nice to have this functionality. This MR does a bit more, but from my reading of the readme it should be covered:

Only covers a subset of libxml2 at the moment, contributions are welcome. We try to increase support with each release.

I'd like to have this for some validation in tests where I compare the output of some rust code against a canonicalized result of an xpath query.
It's doable with this crate and xml_c14n, but having it as part of rust-libxml would remove the need to serialize the libxml document and then parse it as a libxml document directly afterwards.

@dginev
Copy link
Member

dginev commented Dec 23, 2024

Having independent intrerest is certainly helpful, as is pointing out that c14n can be fully thought of as exposing libxml2 functionality. So I am feeling a lot more inclined to merge the PR - I will make some time to review it in more depth soon.

Would you like to also add a review for this PR @tstenner ? Given you are an active user of the feature, you may have good visibility of any edges left to iron out.

Copy link

@tstenner tstenner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments regarding the c14n function parameters below. Other than that the API looks easy to use.

I didn't find anything wrong with the implementation. The output buffer might be shorter, but I'm not as knowledgable in this area.

The test coverage mirrors the "official" test cases and is quite extensive.

}

/// Canonicalization specification to use
#[derive(Copy, Clone, Eq, PartialEq, Ord, PartialOrd, Hash, Debug, Default)]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had implemented this as follows:

/// Canonicalization mode for [`Document.c14n`]
pub enum C14NMode {
  /// [XML_C14N_1_0](https://www.w3.org/TR/2001/REC-xml-c14n-20010315)
  Mode1_0,
  /// [XML_C14N_1_1](https://www.w3.org/TR/xml-c14n11/)
  Mode1_1,
  /// [XML_C14N_EXCLUSIVE_1_0](https://www.w3.org/TR/xml-exc-c14n/)
  ModeExclusive1_0(Vec<CString>),
}
impl C14NMode {
  fn as_c_int(&self) -> c_int {
    match self {
      C14NMode::Mode1_0 => xmlC14NMode_XML_C14N_1_0 as c_int,
      C14NMode::Mode1_1 => xmlC14NMode_XML_C14N_1_1 as c_int,
      C14NMode::ModeExclusive1_0(_) => xmlC14NMode_XML_C14N_EXCLUSIVE_1_0 as c_int,
    }
  }
}

It doesn't map the C API 1:1, but makes it clear the inclusive_ns_prefixes only apply to the exclusive c14n mode

}
}

unsafe fn c_obuf_into_output(c_obuf: xmlOutputBufferPtr) -> String {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first approach looked like this:

let buffer = xmlBufferCreate();
let c14n_res = xmlC14NExecute(,
        xmlOutputBufferCreateBuffer(buffer, ptr::null_mut()),
      );
let result = xmlBufferContent(buffer);
let c_string = CStr::from_ptr(result as *const c_char);
let node_string = c_string.to_string_lossy().
xmlBufferFree(
Ok(node_string)

It's shorter and fewer additional functions, but I'm probably missing some edge cases / needed error handling.

}

/// Create a [Vec] of null-terminated [*mut xmlChar] strings
fn to_xml_string_vec(vec: Vec<String>) -> Vec<*mut xmlChar> {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on the audience it might be feasible to require the inclusive namespaces to be a Vec<CString>. After all, the namespaces to keep will most likely be hardcoded anyways.


/// Options for configuring how to canonicalize XML
#[derive(Clone, Eq, PartialEq, Ord, PartialOrd, Hash, Debug, Default)]
pub struct CanonicalizationOptions {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With inclusive_ns_prefixes as part of the CanonicalizationMode (see below), there are only two options left. For those, I'd rather have them as arguments to the functions instead of having an options struct.

pub fn canonicalize(
&self,
options: CanonicalizationOptions,
callback: Option<(xmlNodePtr, xmlC14NIsVisibleCallback)>,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had implemented the callback functionality as follows:

type IsVisibleCallback = Box<dyn Fn(&RoNode, &RoNode) -> bool>;

// thin pointer wrapper that calls supplied the callback instead
unsafe extern "C" fn _is_visible_wrapper(
  data: *mut c_void,
  node: xmlNodePtr,
  parent: xmlNodePtr,
) -> c_int {
  let callback = unsafe { &mut *(data as *mut IsVisibleCallback) };
 // handling of parent nodes etc.
 // …
 (callback)(&RoNode(node), &RoNode(parent))) as c_int
}

impl Document{
  /// Canonicalizes the XML document according to the W3C XML Canonicalization specification.
  ///
  /// This method produces a canonical form of the XML document, which is useful for digital signatures
  /// and document comparison. The canonicalization process ensures consistent representation of the XML content.
    pub fn c14n(
    &self,
    mode: C14NMode,
    with_comments: bool,
  ) -> Result<String, ()> {
    self.c14n_with_visibility_callback(None, mode, with_comments)
  }

  /// Canonicalizes the document with an optional visibility callback
  ///
  /// `is_visible_callback(node: &RoNode, parent: &RoNode)` is called for every
  /// node having a parent, returning true if the node should be included in the
  /// canonicalized output.
  pub fn c14n_with_visibility_callback(
    &self,
    is_visible_callback: Option<IsVisibleCallback>,
    mode: C14NMode,
    with_comments: bool,
  ) -> Result<String, ()> {
      // boxes the callback so it can be passed as a void pointer to [`_is_visible_wrapper`]
      let (is_visible_fn, mut user_data) = match is_visible_callback {
        Some(f) => (
          Some(_is_visible_wrapper as unsafe extern "C" fn(_, _, _) -> _),
          Some(Box::into_raw(f)),
        ),
        None => (None, None),
      };
      let c14n_res = xmlC14NExecute(
        self.doc_ptr(),
        is_visible_fn,
        user_data
          .as_mut()
          .map(|s| ptr::from_mut(s))
          .unwrap_or(ptr::null_mut()) as *mut c_void,
        mode.as_c_int(),
        inclusive_ns_prefixes,
        with_comments as c_int,
        xmlOutputBufferCreateBuffer(buffer, ptr::null_mut()),
      );
      // …
  }
}

Usage looks like this:

let input = r#"<ns1:root><ns2:foo x="1"  a="2"/><!--cmt--><a/><b/></ns1:root>"#;
let callback = |_node: &RoNode, _parent: &RoNode| {
    !(_parent.get_name() == "ns1:root" && _node.get_name() == "a")
  };
  let c14n_result = doc.c14n_with_visibility_callback(
    Some(Box::new(callback)),
    libxml::tree::document::C14NMode::Mode1_1,
    false,
  );

It's a lot more flexible, (I assume) a lot more complex once the ancestor nodes are handled probably as in your PR

};

impl Document {
/// Canonicalize a document and return the results.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring could use a more information or a link to the libxml2 documentation (which isn't that great either). The callback parameter is obvious and could benefit from an examples

@tstenner
Copy link

@dginev Thanks for reconsidering it. Overall, I only had one obvious improvement to suggest and an opinion regarding the callback parameter (but then again, I don't use it, so it doesn't matter to me that much). The output buffer handling could be shorter, but that is something you know more about than me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants